Using meta-analysis and CNN-NLP to review and classify the medical literature for normal tissue complication probability in head and neck cancer

Purpose The study aims to enhance the efficiency and accuracy of literature reviews on normal tissue complication probability (NTCP) in head and neck cancer patients using radiation therapy. It employs meta-analysis (MA) and natural language processing (NLP). Material and methods The study consists of two parts. First, it employs MA to assess NTCP models for xerostomia, dysphagia, and mucositis after radiation therapy, using Python 3.10.5 for statistical analysis. Second, it integrates NLP with convolutional neural networks (CNN) to optimize literature search, reducing 3256 articles to 12. CNN settings include a batch size of 50, 50–200 epoch range and a 0.001 learning rate. Results The study's CNN-NLP model achieved a notable accuracy of 0.94 after 200 epochs with Adamax optimization. MA showed an AUC of 0.67 for early-effect xerostomia and 0.74 for late-effect, indicating moderate to high predictive accuracy but with high variability across studies. Initial CNN accuracy of 66.70% improved to 94.87% post-tuning by optimizer and hyperparameters. Conclusion The study successfully merges MA and NLP, confirming high predictive accuracy for specific model-feature combinations. It introduces a time-based metric, words per minute (WPM), for efficiency and highlights the utility of MA and NLP in clinical research. Supplementary Information The online version contains supplementary material available at 10.1186/s13014-023-02381-7.


Introduction
Advancements in radiation therapy techniques for head and neck cancer have significantly improved patients' quality of life [1].However, potential complications such as dysphagia, xerostomia, and mucositis can hinder recovery and amplify adverse effects.Specifically, radiation-induced xerostomia substantially diminishes patients' well-being, leading to oral health issues and communication barriers [2].
To enhance the welfare of head and neck cancer patients, researchers are exploring innovative approaches, including artificial intelligence (AI) and predictive algorithms, to investigate potential risk factors for complications.This multidisciplinary research has proliferated a vast body of publications.For instance, a literature search using the terms "artificial intelligence" and "head and neck cancer" between 2013 and May 2022 yielded 734,207 related articles on WOS, indicating a marked upward trend.
Given the sheer volume of published literature, comprehensive understanding through traditional literature reviews becomes increasingly challenging.Therefore, systematic search and filtering methods are crucial.Optimized strategies involve meta-analysis (MA) for synthesizing literature information, quantitatively integrating high-quality data to create valuable annotated datasets, thereby providing robust quantitative evidence for clinical decision-making.
However, conducting an integrated MA is time-consuming and labor-intensive, particularly in literature screening [3].Reviewers face the daunting task of sifting through a plethora of articles with varying degrees of expertise and clinical relevance.To enhance the efficiency and accuracy of MA, this study employs natural language processing (NLP) techniques.As a significant branch of Artificial Intelligence, NLP enables computers to understand human language and has proven its applicability across various domains [4].Utilizing NLP can augment the quantitative capabilities of MA, minimize human errors, and automate the screening process.The primary aim of this approach is to improve analytical efficiency while reducing human error.
NLP accelerates literature reviews by adeptly categorizing pertinent articles.Numerous studies have improved machine learning methods using publicly accessible literature from 15 systematic reviews [5][6][7][8].For instance, Yujia et al. employed various machine learning models to classify abstracts into two categories related to cancer risk in genetic mutation carriers (penetrance) or the prevalence of genetic mutations [3].Impressively, they achieved over 88% accuracy in both models.Zhengyi et al. demonstrated that NLP-based methods could substantially reduce the review workload while maintaining the ability to identify relevant research [3].However, to date, no NLP techniques have been specifically tailored for literature on complications following head and neck cancer radiation therapy or normal tissue complication probability (NTCP).Furthermore, there's a conspicuous lack of an annotated dataset for crafting a machine learning model dedicated to discerning relevant articles in this domain.
Our research aims to fill this gap by creating an annotated abstract dataset focusing on the likelihood of three common complications post-radiation therapy for head and neck cancer-mucositis, xerostomia, and dysphagia.We will employ machine learning-based NLP methods to classify abstracts into this annotated dataset.The ultimate goal is to minimize human error and enhance analytical efficiency.

Research framework
Our research process, based on MA, is divided into two parts, as depicted in Fig. 1.The first part employs MA to investigate NTCP predictive models for three common complications post-radiation therapy in head and neck cancer patients-xerostomia, dysphagia, and mucositis.The study encompasses patient demographics, methodologies, and outcomes, hypothesizing that significant variations may arise from different complication types, model choices, and predictive factors.By comparing various models and feature combinations, we aim to identify those with superior predictive capabilities, offering more effective predicting methods for clinical use.Statistical analyses are conducted using Python 3.10.5, with the null hypothesis stating that all model-feature combinations perform equally well in predicting complications, and the alternative hypothesis positing that at least one combination significantly outperforms the others.
The second part integrates natural language processing with convolutional neural networks (CNN) to enhance literature retrieval efficiency and result reliability.This approach aims to accelerate the time required for research on the NTCP of complications in head and neck cancer, offering quicker and more reliable insights for future studies and clinical applications.

Eligibility criteria, information sources, and search strategy
This study outlines the research content on head and neck cancer patients using the PICOS framework [9] (patient characteristics, intervention measures, control group, outcome), as showed in Fig. 2. Patient characteristics focus on head and neck cancer patients; interventions encompass all radiation therapy techniques for treating this cancer; control groups are categorized into machine learning, deep learning model types, and feature factors; and the outcome metric targets the AUC of multivariate NTCP models.Given its non-RCT or CCT nature, the study falls under the category of prospective trials.
After formulating the research theme, database searches are conducted using relevant keywords, covering both titles and abstracts.Primary search keywords are organized into three layers: patient, method, and outcome, and are explored in conjunction with the PICOS framework.To ensure completeness, Boolean "AND" searches are specifically performed for combinations of complications with AI and NTCP.Beyond the PICOS framework, the study also employs PubMed's MeSH terms and related literature to broaden its scope.Boolean  S1) [10][11][12].

Selection process
Data extracted from each included study is determined through collaborative discussions among reviewers.One reviewer is responsible for data collection, while another performs cross-validation.The data encompasses authorship, publication year, types of complications, radiation therapy methods, employed models, features (prognostic factors), performance evaluation, as well as the study's contributions and conclusions.

Data extraction and risk of bias (RoB) assessment
In our study, when evaluating the quality and potential biases of the literature for MA, we opted for the PROBAST tool (Prediction model Risk Of Bias ASsessment Tool) over the commonly used Cochrane risk of bias assessment (RoB) tool.This strategic choice was influenced by the realization that a significant portion of the studies-included did not align well with the criteria of the Cochrane tool due to their unique characteristics.
PROBAST evaluates four domains: participants, predictors, outcome, and analysis.The participants domain assesses the representativeness of the target population and selection bias; the predictors domain evaluates the selection, relevance, reliability, and handling of predictive factors; the outcome domain focuses on the measurement and definition of outcomes, assessing their accuracy and consistency; and the analysis domain reviews methods for model development and validation, including sample size, missing data handling, model calibration, and discriminative ability.
Bias risk assessment is conducted using the PROBAST Excel interface developed by Borja M. Fernandez-Felix [13], with risk determinations-low, high, or unclearderived from responses to signaling questions.An overall low risk is assigned only if all domains are low-risk; a single high-risk domain results in an overall high risk; and an unclear risk in one domain with low risk in others leads to an overall unclear risk.If all model domains are low-risk but lack external validation, the risk is elevated to high; however, if based on extensive data with internal validation, it can be considered overall low-risk.

Statistical methods
The MA in this study primarily contains the following key components:

Natural language processing (NLP) program design
To expedite the identification and retrieval of relevant literature while ensuring result reliability and accuracy, this study adopts a CNN for NLP, drawing inspiration from Yujia Bao's MA NLP model design [3].This choice not only considers the nature of the data but also facilitates platform development, paving the way for the future integration of more deep learning models to enhance the classifier's accuracy and generalizability.In terms of abstract identification, the CNN model employed is capable of automatically learning language features from extensive text and achieving results across various tasks.Through word vector transformation and feature extraction, the CNN model effectively performs text classification and sentiment analysis.Key parameters used in this study include a batch size of 50, epoch range of 50-200, and a learning rate of 0.001.

Data preprocessing
The data preprocessing in this study is divided into two main phases.First, abstracts and titles that have undergone manual retrieval and initial screening are allocated into training, validation, and test sets.The positive and negative samples in the training and validation sets are distributed at a 2:8 ratio, while the test set is further fine-tuned to a more realistic 15:85 ratio to better reflect the prevalence of irrelevant samples.Second, for word vector embedding, the text is converted into jsonl format and manually annotated and cleaned, including the removal of potentially misleading punctuation and special characters.These preprocessing steps optimize the text for word vector embedding input in the CNN model, facilitating subsequent NLP and analysis.

Literature review and research selection
After searching the WOS and PubMed databases, this study initially identified 3,256 potentially relevant articles, as illustrated in Fig. 3.The first round of screening, based on titles, eliminated studies unrelated to head and neck cancer or radiation therapy, leaving 87 articles for the second round.The second round, focused on abstracts, further excluded studies not involving head and neck or squamous cell cancer patients, or those not utilizing machine learning or deep learning as evaluation tools, resulting in 36 articles for full-text review.During this phase, articles not addressing predictions, not focusing on complications, or lacking AUC-related outcomes for multivariate NTCP models were also excluded, along with duplicates.Ultimately, 12 articles were included for review [16][17][18][19][20][21][22][23][24][25][26][27].

Performance of the CNN-NLP model
After comparing nine different optimizers, our study opted for Adamax (see Additional file 1: Table S2).With 50 epochs, Adamax achieved a Loss value of 0.51, an accuracy of 0.85, and an F1-Score of 0.75, along with a precision of 0.71.When the epochs were increased to 100, the accuracy and F1-Score improved to 0.87 and 0.79, respectively, while the precision reached 0.84.At 200 epochs, both accuracy and F1-Score peaked at approximately 0.94, clearly demonstrating the superior performance of the Adamax optimizer in the model.After optimizer fine-tuning, as shown in Table 1, we evaluated coverage performance, which measures the overlap of identified studies under specific search subset conditions and assesses the efficacy of automated processing.We conducted tests on four different subsets, from WOS T1 to Pubmed T4, and compared the coverage rates when using Adam and Adamax optimizers across training cycles of 200, 100, and 50 epochs.In WOS T1, coverage was generally 0/9 regardless of the optimizer or training cycle, with Adam reaching a peak of 1/7 and low recognition frequency.In Pubmed T2, coverage was mostly 0/7, but a few articles were identified at epochs 100 and 50, not exceeding two in total.In WOS T3, Adam achieved a 3/4 coverage rate at 50 epochs, similar to Adamax.For Pubmed T4, Adam reached a 3/4 coverage rate at 100 epochs, while Adamax showed more stable performance across all training cycles, peaking at 2/4.
In the aspect of words per minute (wpm) for literature review, our study introduces a more objective method for time quantification.Beyond providing a standardized metric for future research, we also employ unit conversion and a deep learning-based Natural Language Text Classifier for temporal comparisons.In Table 2, we also calculated and compared the time spent on alternative tasks, converting wpm results to seconds, the details for the screening speed measured in WPM can be seen in Additional file 1: Table S3.We then contrasted this with the average time needed for text recognition during preprocessing in T1-T4 test sets using an Adamax-optimized CNN-NLP model.As shown in Table 1, despite considerations like text recognition capabilities, the time efficiency gained through NLP shows a significant, intuitive difference.(Code for WPM Calculation Algorithm captured from the monitor is shown in Additional file 1: Figure S1).

Features and model methods: systematic review
As shown in Table 3, the "studies-included" feature table aligns with the three dimensions of the MA issue discussed in our Materials and Methods section.In addition to the authors and publication years, the table also encompasses demographic characteristics, complications, types of radiation therapy techniques, algorithmic combinations in predictive models, predictive performance, and selected predictive factors.The systematic review ultimately included a total of 12 studies [16][17][18][19][20][21][22][23][24][25][26][27].
The forest plot is illustrated in Fig. 4, the present study undertakes a comprehensive and rigorous meta-analysis, focusing specifically on predictive models for xerostomia.Utilizing a feature table, we meticulously integrated the models employed across various studies and further stratified them into early and late phases for sub-group analysis.The combined effect sizes for these sub-groups are visually represented through forest plots (The funnel plot is included in Additional file 1: Figure S2).The temporal demarcation for these phases was set at six months, based on the seminal work of Hubert S. Gabryś [16].
Statistically speaking, the overall effect size for the Area Under the Curve (AUC) of early-effect xerostomia models (Fig. 4a) was 0.67, with a 95% Confidence Interval (CI) ranging from 0.40 to 0.91.This indicates that these models possess moderate predictive accuracy for earlyeffect xerostomia.However, the high heterogeneity, as evidenced by an I 2 value of 80.32% and a Q-statistic of 5.34, suggests significant variability across different studies.For late-effect xerostomia (Fig. 4b), the overall AUC effect size was 0.74, with a 95% CI of 0.46 to 0.98.This result further corroborates the models' relatively high predictive efficacy for late-effect xerostomia.Nevertheless, the exceedingly high heterogeneity (I 2 = 97.99%,Q-statistic = 52.48)implies that the applicability of these models may be limited across different research settings or patient populations.
In Table 4, titled "Prediction model Risk of Bias in Included Studies," the output for each question represents distinct focal points of work, encompassing a comprehensive evaluation of all critical stages in the development and application of prediction models as assessed by PROBAST.The assessment content is divided into four domains: 1. Participants, 2. Predictive Factors, 3. Outcomes, and 4. Analysis.These domains are further         categorized based on three assessment outcomes, primarily labeled as "High Risk," "Low Risk," and "Unclear or Ambiguous."Although the overall assessment reveals that only four studies exhibited low risk of bias in their data, with the remainder falling under high risk or unclear categories, it is noteworthy that in terms of applicability, only two included studies were assessed as having a higher risk, while two were categorized as unclear or ambiguous.This suggests that while there may be a pervasive issue of data bias, the applicability of these studies is less frequently compromised, thereby indicating a need for more rigorous methodological scrutiny to enhance the reliability and utility of future prediction models.

Results of the MA study
In our study, we conducted a comprehensive retrospective analysis to evaluate AI-based predictive models for forecasting post-radiation complications like xerostomia in head and neck cancer patients.Our data revealed significant effect sizes of 0.67 and 0.74 for early and latestage xerostomia, respectively, with p-values below 0.05, highlighting the distinctiveness of AI-based models in this context.
Interestingly, our findings contrast with earlier research by our team (Lee et al. [17,18]) and Van Dijk et al. [19] We observed that incorporating image biomarkers, such as pre-processed CT data, did not necessarily enhance predictive accuracy compared to models solely based on traditional clinical factors and machine learning algorithms.This discrepancy may stem from variations in dataset composition and algorithmic parameters during model training and validation.Further, research by Gabry et al. [16] identified key features like dosimetric shapes and salivary gland volume through algorithmic comparisons, reiterating the significant divergence between AI-based and traditional clinical models in xerostomia prediction.
However, our study also revealed certain limitations and challenges.Firstly, the limited scope of databases for literature search led to incomplete data and insufficient literature, restricting our ability to perform comprehensive meta-analyses and forest plot illustrations.Secondly, some studies lacked complete data, such as predictive confidence intervals, which further impacted our analysis.Just as per any other site, CNS NTCP literature suffers the same limitations, and no AI has been successfully implemented as yet [28].Overall, while our study made progress in predicting normal tissue complications after radiotherapy for head and neck cancer, further research and validation are needed.Our findings align with Chulmin Bang's 2023 literature review, emphasizing that the clinical application of AI models still requires more in-depth exploration and validation [29].

Performance of the CNN-NLP model, optimizer optimization, and coverage
In this study, we presented an analysis focusing on the coverage rate of imbalanced datasets.Despite optimizing the algorithmic parameters, we abstained from employing data augmentation techniques like oversampling or undersampling to bolster the model's predictive accuracy.Our text classification model was conceptualized based on the research framework proposed by Yujia Bao, MA [3].It's worth noting that this CNNbased model predominantly relies on abstracts rather than full texts for analysis.Consequently, the conversion rate of the included literature could be susceptible to variations in research themes and inclusion criteria, a limitation also acknowledged in Yujia Bao's work [3].Nevertheless, recent advancements in large-scale language models such as GPT-3 and GPT-4 have shown capabilities in recognizing diverse file formats, including PDFs [30], and have exhibited remarkable precision in medical text identification [30,31].Progress has also been made in the realm of deep learning for medical text analysis, exemplified by CNN-based medical report retrieval studies [32].These technological strides open new avenues for medical text identification, potentially mitigating the aforementioned limitations.We are currently exploring the development of models designed for automated full-text reviews to further enhance the comprehensiveness and accuracy of literature analyses.

Conclusion
In this study, we employ an integrative approach combining MA and NLP to explore feature factors for NTCP in head and neck cancer.Our results reject the null hypothesis H 0 , confirming that specific model- feature combinations yield high predictive accuracy for identical complications.Utilizing CNNs in NLP, we streamline the meta-analytical process and introduce a time-based metric, words per minute (WPM) [33], for efficiency evaluation.This study underscores the utility of meta-analysis and NLP in clinical research, offering a methodological advancement for future studies aiming to optimize predictive models and operational efficiency.
• fast, convenient online submission • thorough peer review by experienced researchers in your field • rapid publication on acceptance • support for research data, including large and complex data types • gold Open Access which fosters wider collaboration and increased citations maximum visibility for your research: over 100M website views per year

•
At BMC, research is always in progress.

Learn more biomedcentral.com/submissions
Ready to submit your research Ready to submit your research ?Choose BMC and benefit from: ?Choose BMC and benefit from:

Fig. 1
Fig. 1 Research workflow diagram.CNN Convolutional neural networks, NLP Natural language processing, WOS Web of science, PICOS Patient characteristics, Intervention measures, Control group, Outcome

Fig. 2
Fig. 2 Search framework.AI Artificial intelligence, HNC Head and neck cancer, NTCP Normal tissue complication probability, WOS Web of science

Fig. 3
Fig. 3 Article Selection flowchart.WOS Web of science

Fig. 4
Fig. 4 Forest plot a the overall effect size for the Area Under the Curve (AUC) of early-effect xerostomia models b For late-effect xerostomia models

Table 1
Coverage results

Table 2
Time difference comparison between manual and nlp classifier approachesCNN Convolutional neural networks, NLP Natural language processing, WOS Web of science

Table 3
Features for the included studies

Table 4
Prediction model Risk of Bias in included studies High risk is denoted by "-"; *Low risk is denoted by " + "; *Unclear or ambiguous is denoted by "?" *